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Abstract 



We consider the problem of learning a structured multi-task regression, where the output con- 
sists of multiple responses that are related by a graph and the correlated response variables are 
>■ ' dependent on the common inputs in a sparse but synergistic manner. Previous methods such as 

^ ■ £i/£2-regularized multi-task regression assume that all of the output variables are equally related 

l/-^ . to the inputs, although in many real-world problems, outputs are related in a complex manner. 

^ ! In this paper, we propose graph-guided fused lasso (GFlasso) for structured multi-task regression 

•£^ I that exploits the graph structure over the output variables. We introduce a novel penalty function 

(^ ' based on fusion penalty to encourage highly correlated outputs to share a common set of relevant 

inputs. In addition, we propose a simple yet efficient proximal-gradient method for optimizing 
GFlasso that can also be applied to any optimization problems with a convex smooth loss and the 
K> , general class of fusion penalty defined on arbitrary graph structures. By exploiting the structure 

H I of the non-smooth "fusion penalty", our method achieves a faster convergence rate than the stan- 

dard first-order method, sub-gradient method, and is significantly more scalable than the widely 
adopted second-order cone-programming and quadratic -programming formulations. In addition, 
we provide an analysis of the consistency property of the GFlasso model. Experimental results not 
only demonstrate the superiority of GFlasso over the standard lasso but also show the efficiency 
and scalability of our proximal-gradient method. 
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1 Introduction 

In multi-task learning, we are interested in learning multiple related tasks jointly by analyzing data 
from all of the tasks at the same time instead of considering each task individually [jlTl l22l [TSl . 
When data are scarce, it is greatly advantageous to borrow the information in the data from other 
related tasks to learn each task more effectively. 

In this paper, we consider a multi-task regression problem, where each task is to learn a func- 
tional mapping from a high-dimensional input space to a continuous-valued output space and only 
a small number of input covariates are relevant to the output. Furthermore, we assume that the out- 
puts are related in a complex manner, and that this output structure is available as prior knowledge 
in the form of a graph. Given this setting, it is reasonable to believe that closely related outputs 
tend to share a common set of relevant inputs. Our goal is to recover this structured sparsity pattern 
in the regression coefficients shared across tasks related through a graph. 

When the tasks are assumed to be equally related to inputs without any structure, a mixed-norm 
regularization such as £1/^2- and £i/£oo-norms has been used to find inputs relevant to all of the 
outputs jointly |fT5l[T9l . However, in many real-world problems, some of the tasks are often more 
closely related and more likely to share common relevant covariates than other tasks. Thus, it is 
necessary to take into account the complex correlation structure in the outputs for a more effective 
multi-task learning. For example, in genetic association analysis, where the goal is to discover few 
genetic variants or single neucleotide polymorphisms (SNPs) out of millions of SNPs (inputs) that 
influence phenotypes (outputs) such as gene expression measurements [|23l , groups of genes in the 
same pathways are more likely to share common genetic variants affecting them than other genes. 
In a neuroscience application, an £i/£oo -regularized multi-task regression has been used to predict 
neural activities (outputs) in brain in response to words (inputs) IfTTI . Since neural activities in the 
brain are locally correlated, it is necessary to take into account this local correlation in different 
brain regions rather than assuming that all regions share a similar response as in [fTTll . A similar 
problem arises in stock prediction where some of the stock prices are more highly correlated than 
others O. 

The main contributions of this paper are two-fold. First, we propose a structured regularized- 
regression approach called graph-guided fused lasso (GFlasso) for sparse multi-task learning prob- 
lems and introduce a novel penalty function based on fusion penalty that encourages tasks related 
according to the graph to have a similar sparsity pattern. Second, we propose an efficient optimiza- 
tion algorithm based on proximal-gradient method that can be used to solve GFlasso optimization 
as well as any optimizations with a convex smooth loss and general fusion penalty which can be 
defined on an arbitrary graph structure. 

In addition to the standard lasso penalty for overall sparsity fTT|, GFlasso employs a "fusion 
penalty" [18] that fuses regression coefficients across correlated outputs, using the weighted con- 
nectivity of the output graph as a guide. The overall effect of the GFlasso penalty is that it allows us 
to identify a parsimonious set of input factors relevant to dense subgraphs of outputs as illustrated 
in Figure [TJ To the best of our knowledge, this work is the first to consider the graph structure over 
the outputs in multi-task learning. We also provide an analysis of the consistency property of the 
GFlasso model. 

The fusion penalty that we adopt to construct the GFlasso penalty has been widely used for 
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Figure 1: Illustrations of multi-task regression with (a) lasso, (b) graph-guided fused lasso. 




sparse learning problems, including fused lasso p8l|, fused-lasso signal approximator [5J, and 
network learning ifTOl . However, because of the non-separability of the fusion penalty function, 
developing a fast optimization algorithm has remained a challenge. The available optimization 
methods include formulating the problem as second-order cone programming (SOCP) or quadratic 
programming (QP) and solving them by interior-point methods (IPM) [il8iK but these approaches 
are computationally expensive even for problems of moderate size. In order to reduce the compu- 
tational cost, when the fusion penalty is defined on a chain or two-way grid structure over inputs, 
the pathwise coordinate optimization has been applied ^. However, when the fusion penalty is 
defined on a general graph, this method cannot be easily applied because of the relatively complex 
sub-gradient representation. In addition, as pointed out in (lW\, this method "is not guaranteed 
to yield exact solution" when the design matrix is not orthogonal. Very recently, an unpublished 
manuscript ^ proposed a different algorithm which reformulates the problem as a maximum flow 
problem. However, this algorithm works only when the dimension is less than the sample size 
and hence is not applicable to high-dimensional sparse learning problems. In addition, it lacks any 
theoretical guarantee of the convergence. 

In order to make the optimization efficient and scalable, a first-order method (using only gra- 
dient) is desired. In this paper, we propose a proximal-gradient method to solve GFlasso. More 
precisely, by exploiting the structure of the non-smooth fusion penalty, we introduce its smooth 
approximation and optimize this approximation based on the accelerated gradient method in fJM . 
It can be shown that our method achieves the convergence rate of O(-), where the e is the desired 
accuracy. Our method is significantly faster than the most natural first-order method, subgradient 
method [4J with O(^) convergence rate and more scalable by orders of magnitude than IPM for 
SOCP and QP. We emphasize that although we present this algorithm for GFlasso, it can also be 
applied to solve any regression problems involving a fusion penalty such as fused lasso with a 
univariate response, where the fusion penalty can be defined on any structures such as chain, grid, 
or graphs. In addition, it is easy to implement with only a few lines of MATLAB code. 

The rest of the paper is organized as follows. In Section 2, we briefly review lasso and £1/^2- 
regularized regression for sparse multi-task learning. In Section 3, we present GFlasso. In Section 
4, we present our proximal-gradient optimization method for GFlasso and other related optimiza- 
tion problems, and discuss its convergence rate and complexity analysis. In Section 5, we present 
the preliminary consistency result of GFlasso. In Section 6, we demonstrate the performance of 
the proposed method on simulated and asthma datasets, followed by conclusions in Section 7. 



2 Preliminary: li- and ^1/^2-Reguliarized Multi-task Regres- 
sion 

Assume a sample of N instances, each represented by a J-dimensional input vector and a K- 
dimensional output vector. Let X = (xi, . . . , xj) G R^^"' denote the input matrix, and let Y = 
(yi, . . . , yx) G M^^^ represent the output matrix. For each of the K output variables (so called 
tasks), we assume a linear model: 

yfc = X/3, + efc, \Jk = l,...,K, (1) 

where /3^ = (/^i^, . . . , (5jkY ^ J^'^ is the vector of regression coefficients for the /c-th output 
variable, and e^ is a vector of A^ independent zero-mean Gaussian noise. We center the y^'s and 
Xj's such that X]j=i Vik = ^^^ X]j=i ^«i — 0' ^^^ consider the model without an intercept. 

Let B = (/3i, . . . , f3x) denote the J x K matrix of regression coefficients of all K response 
variables. Then, lasso ifTTll obtains B''"^'*" by solving the following optimization problem: 

giasso ^ argmin-||Y - XB||^ + A||B||i, (2) 

B 2 

where || ■ \\p denotes the matrix Frobenius norm, || ■ ||i denotes the the entry-wise matrix £i-norm 
(i.e., ||B||i = J2k=iJ2j=i |/3jfc|) ^iid Ais aregularizationparameterthat controls the sparsity level. 
We note that lasso in ^ does not offer any mechanism for a joint estimation of the parameters for 
the multiple outputs. 

Recently, a mixed-norm (e.g., £1/^2) regularization has been used for a recovery of joint spar- 
sity across multiple tasks, when the tasks share the common set of relevant covariates [3^.T31. More 
precisely, it encourages the relevant covariates to be shared across output variables and finds esti- 
mates in which only few covariates have non-zero regression coefficients for one or more of the K 
output variables. The corresponding optimization problem is given as follows: 

B'^/'^ = argmin-||Y - XB||| + A||B||i,2, (3) 

B 2 

where ||B||i2 = X]i=i ll/^'^lh^ f^'' is the j-th row of the regression coefficient matrix B, and || • II2 
denotes the vector £2-norm. Although the mixed-norm allows information to be combined across 
output variables, it assumes all of the tasks are equally related to inputs, and cannot incorporate a 
complex structure in how the outputs themselves are correlated. 

3 Graph-guided Fused Lasso for Sparse Structured Multitask 
Regression 

In this section, we propose GFlasso that explicitly takes into account the complex dependency 
structure in the output variables represented as a graph while estimating the regression coefficients. 
We assume that the output structure of the K output variables is available as a graph G with a set 



of nodes V^ = {1, . . . , K} and edges E. In this paper, we adopt a simple strategy for constructing 
such graphs by computing pairwise correlations based on y^'s, and connecting two nodes with an 
edge if their correlation is above a given threshold p. More sophisticated methods can be easily 
employed, but they are not the focus of this paper. Let Vmi G M denote the weight (can be either 
positive or negative) of an edge e = (m, I) E E that represents the strength of correlation between 
the two nodes. Here, we simply adopt the Pearson's correlation between y^ and y; as r mi- 
Given the graph G, it is reasonable to assume that if two output variables are connected with an 
edge in the graph, they tend to be influenced by the same set of covariates with similar strength. In 
addition, we assume that the edge weights in the graph G contain information on how strongly the 
two output variables are related and thus share relevant covariates. GFlasso employs an additional 
constraint over the standard lasso by fusing the (3jm and (3ji if (m, I) E E as follows: 

1 -^ 

B°P = mm/(B)^-||Y-XB||| + A||B||i+7 Y. ^(^-0 E 1^^- " ^^g"(^-')^^-'l' ^4) 

e={m,l)eE i=l 

where A and 7 are regularization parameters that control the complexity of the model. A larger 
value for 7 leads to a greater fusion effect. In this paper, we consider T{r) = |r|, but any positive 
monotonically increasing function of the absolute value of correlations can be used. The T{rmi) 
weights the fusion penalty for each edge such that I3jm and I3ji for highly correlated outputs with 
large \rmi \ receive a greater fusion effect than other pairs of outputs with weaker correlations. The 
sign(rm;) indicates that two negatively correlated outputs are encouraged to have the same set of 
relevant covariates with the same absolute value of regression coefficients of the opposite sign. 

When the edge-level fusion penalty is applied to all of the edges in the entire graph G in the 
GFlasso penalty, the overall effect is that each subset of output variables within a densely connected 
subgraph tends to have common relevant covariates. This is because the fusion effect propagates 
through the neighboring edges, fusing the regression coefficients for each pair of outputs con- 
nected by an edge, where the amount of such propagation is determined by the level of local edge 
connectivities and edge weights. 

The idea of using a fusion penalty has been first proposed for the problem with a univari- 
ate response and high-dimensional covariates to fuse the regression coefficients of two adjacent 
covariates when the covariates are assumed to be ordered such as in time [18J. In GFlasso, we em- 
ploy a similar but more general strategy in a multiple-output regression in order to identify shared 
relevant covariates for related output variables. 

4 Proximal-Gradient Method for Optimization 

Although the optimization problem for GFlasso in ([4]) is convex, it is not trivial to optimize it 
because of the non-smooth penalty function. The problem can be formulated as either SOCP or 
QP using the similar strategy in ifTSl . The state-of-the-art approach for solving SOCP and QP 
are based on an IPM that requires solving a Newton system to find a search direction. Thus, it is 
computationally very expensive even for problems of moderate size. 



In this section, we propose a proximal-gradient method which utilizes only the first-order infor- 
mation with a fast convergence rate and low computation complexity per iteration. More precisely, 
we first reformulate the £i and fusion penalty altogether into a max problem over the auxiliary 
variables, and then introduce its smooth lower bound. Instead of optimizing the original penalty, 
we adopt the accelerated gradient descent method lfT4l to optimize the smooth lower bound. 

The approach of "proximal" method is quite general in that it optimizes a lower or upper bound 
of the original objective function, rather than optimize the objective function directly. This lower or 
upper bound has a simpler form that allows for an easy optimization. Recently, different variants of 
a proximal-gradient method have been applied to solve optimization problems with a convex loss 
and non-smooth penalty, including matrix completion |[9l and sparse signal reconstruction [16J. 
However, the non-smooth penalties in these works are the norm of the coefficient itself instead of 
the norm of a linear transformation of the coefficient. So they use a quadratic approximation to the 
smooth loss function while keeping the easily-handled non-smooth penalty in their original form. 
In contrast, in this work, we find a smooth lower bound of the complicated non-smooth penalty 
term. 

4.1 A Reformulation of the Non-smooth Penalty Term 

First, we rewrite the graph-guided fusion penalty function in (Hj), using a vertex-edge incident 
matrix H e R^""^^^, as follows: 

J 

Y^ rirrra) J] \(3jm - sign(r„a)/3j7l = l|Bi/||i, 

e=(m,l)£E j=l 

where H e M^^l^l is a variant vertex-edge incident matrix defined as below: 

T{rmi) if e = (m, /) and k = m 

Hk,e = { -sign{rmi)r{rrni) if e = (m, /) and A; = / 
otherwise. 

Therefore, the overall penalty in dH) including both lasso and graph-guided fusion penalty functions 
can be written as ||BC||i, where C = (A/, ^H) and / G M^^^ denotes an identity matrix. 

Since the dual norm of the entry-wise matrix £00 norm is the £1 norm, we can further rewrite 
the overall penalty as: 

||BC||i= max (A,BC), (5) 

l|A||oo<l 

where (U, V) = Tr(U^V) denotes a matrix inner product, A E Q = {A|||A||oo < 1,A G 
]^Jx{K+\E\)^ is an auxiliary matrix associated with ||BC||i, and || • ||oo is the matrix entry-wise £00 
norm, defined as the maximum absolute value of all entries in the matrix. 

According to ([5]), the penalty term can be viewed as the inner product of the auxiliary matrix 
A and the linear mapping of B, r(B) = BC, where the linear operator F is a mapping from 
W^^ into RJxiK+\E\)_ By the fact that (A,F(B)) = Tr(A^BC) = Tr(CA^B) = (AC^,B), 
the adjoint operator of F is F*(A) = AC^ that maps M-^x(^+l^l) back into R-^^-^. Essentially, the 



adjoint operator T* is the linear operator induced by T defined in the space of auxiliary variables. 
The use of the linear mapping T and its adjoint T* will simplify our notation and provide a more 
consistent way to present our key theorem as shown in the next section. 

4.2 Proximal-Gradient Method 

The formulation of the penalty in ^ is still a non-smooth function in B, and this makes the 
optimization still challenging. To tackle this problem, we introduce an auxiliary strongly convex 
function to construct a smooth approximation of ([5]). More precisely, we define: 

/^(B)= max (A,BC)-MA), (6) 

||A||oo<l 

where fi is a positive smoothness parameter and d(A) is an arbitrary smooth strongly-convex 
function defined on Q. The original penalty term can be viewed as /^(B) with yU = 0, i.e. 
/o(B) = max||A||^<i(A, BC). Since our algorithm will utilize the optimal solution A* to ^, 
we choose d{A) = 1 1| A|||, so that we can obtain the closed-form solution for A*. 

It can be easily seen that //^(B) is a lower bound of /o(B). To bound the gap between them, let 

D= max d{A) = h\A\\l = l-J{K+\E\). (7) 

||Aj|oo<i 2, Z 

Then, we have /o(B) — /^(B) < fiD = jjiJ{K + \E\)/2. From the key theorem we present below, 
we know that /^(B) is a smooth function for any yU > 0. Therefore, /^j(B) can be viewed as a 
smooth approximation of /o(B) with the maximum gap of fiD and the fi controls the gap between 
f^(B) and /o(B). As we discuss in the next section in more detail, if our desired accuracy is e, 
i.e., /(B*) — /(B*) < e, where B* is the approximate solution at the t-th iteration, and B* is 
the optimal solution to the objective function in ^, we should set /i = ^ to achieve the best 
convergence rate. 

Now, we present the key theorem to show that fi_i(B) is smooth and that V/^(B) is Lipschitz 
continuous. This theorem is also stated in [14J but without a detailed proof of the smoothness prop- 
erty and a derivation of the gradient. We provide a simple proof based on the Fenchel Conjugate 
and properties of subdifferential. The details of the proof are presented in Appendix. 

Theorem 1. For any ^ > 0, /^(B) is a convex and continuously dijferentiable function in B with 
the gradient: 

V^(B) = P(A*) = A*C^, (8) 

where T* is the adjoint operator ofV defined at the end ofSection \4.1\ A* is the optimal solution 
to ^. Furthermore, the gradient V/yj(B) is Lipschitz continuous with the Lipschitz constant L^ = 
-lirp, where \\T\\ is the norm of the linear operator T defined as \\T\\ = max||v||2<i ||r(V)||2. 

To compute the V/^(B) and L^^ in the above theorem, we need to know A* and ||r||. We 
present the closed-form expressions of A* and ||r|| in the following two lemmas. The proof of 
Lemma |2] is provided in Appendix. 



Lemma 1. Let A* be the optimal solution of ^: 

where S is the shrinkage operator defined as follows. For x G M, S{x) = x if —1 < x < 1, 
S{x) = 1 ifx > 1, and S{x) = —1 ifx < —1. For matrix A, S'(A) is defined as applying S on 
each and every entry of A. 

Proof. By taking the derivative of ^ over A and setting it to zeros, we obtain A = 5^. Then, we 
project this solution onto the Q and get the optimal solution A*. D 



Lemma 2. ||r|| is upper bounded by ||r||;7 = a/A^ + 27^ maxfegy dk, where 

dk= J2 (^(^^))' (9) 

eS:E s.t. e incident on k 

for k E V in graph G; and this bound is tight. 

Given the results in Theorem [H now we present our proximal- gradient method for GFlasso. 
We substitute the penalty term in ^ with its smooth approximation /^(B) and obtain the smooth 
optimization problem: 

min7(B) = ^||Y - XBg + /^(B). (10) 

B Z 

According to Theorem [H the gradient of /(B) is: 

V/(B) = X^(XB - Y) + A*C^. (11) 

Note that V/(B) is Lipschitz continuous with the Lipschitz constant L tightly upper bounded by 

Lu: 

L = Aniax(X X) + L^ < Ainax(X X) H = Lu , (12) 

/^ 

where Amax(X^X) is the largest eigenvalue of (X^X). 

Instead of optimizing the original function /(B) in (H)), we optimize /(B). Since /(B) is a 
smooth lower bound of /(B), we can adopt the accelerated gradient-descent method, so called 
Nesterov's method [fT4ll . to minimize smooth /(B) as shown in Algorithm[T] 

In contrast to the standard gradient-descent algorithm. Algorithm [T] involves the updating of 
three sequences {W*}, {B*} and {Z*}, where B* is obtained from the gradient-descent update 
based on W* with the stepsize j-; Z* is the weighted combination of all previous gradient infor- 
mation and W*+^ is the convex combination of B* and Z*. Intuitively, the reason why this method 
is superior to the standard gradient descent is that it utilizes all of the gradient information from 
the first step to the current one for each update, while the standard gradient-descent update is only 
based on the gradient information at the current step. 



Algorithm 1 Proximal-Gradient Method for GFlasso 



Input: X, Y, A, 7, graph structure G, desired accuracy e. 

Initialization: Construct C = (A/, 'jH); compute Lu according to (fT2l) : compute D in^ and set 

/i= 2^;setW0 = gM-^^-^; 

Iterate For t = 0, 1, 2, . . . until convergence of B*: 

1. Compute V7(W*) according to (fTH) . 

2. Perform the gradient descent step : B* = W* — -^V/(W*). 

3. SetZ* = -^E!=o¥v7(W0. 

4. SetW*+i = |±|B* + ^Z*. 

Output: B = B* 

4.3 Complexity 

Although we optimize the approximation function /, it still can be proven that the B obtained from 
Algorithm [His sufficiently close to the optimal solution B* to the original objective function in @). 
We present the convergence rate of Algorithm [Din the next theorem. 

Theorem 2. Let B* be the optimal solution to dH) and B* be the intermediate solution at the t-th 
iteration in AlgorithmUl If we require /(B*) — /(B*) < e and set yU = 1^, then the number of 
iterations t is upper bounded by: 




XTX) + ^i^^Lkl (13) 



where D and ||r|| are as in (|7]) and Lemma^respectively. 

The key idea behind the proof is to decompose /(B*) — /(B*) into 3 parts: (i) /(B*) — /(B*), 
(ii)/(B*)— /(B*),(iii)/(B*) — /(B*). (i) and (iii) can be bounded by the gap of the approximation 
fxD. Since / is a smooth function, we can bound (ii) by the accuracy bound when applying the 
accelerated gradient method to minimize smooth functions [14]. We obtain ([T3l) by balancing 
these three terms. The details of the proof are presented in Appendix. According to Theorem [2l 
Algorithm [U converges in 0{^^^) iterations, which is much faster than the subgradient method 
with the convergence rate of O(^). Note that the convergence rate of our method depends on D 
through the term \/2D, which again depends on the problem size with D = J{K + \E\)/2. 

Theorem [2] suggests that a good strategy for choosing the parameter /i in Algorithm [T] is to set 
yU = 2I), where D is determined by the problem size. Instead of fixing the value for e, we directly 
set /i or the ratio of e and Z^ to a constant, because this automatically has the effect of scaling e 
according to the problem size without affecting the quality of the solution. 



Assuming that we pre-compute and store X^X and X^Y with the time complexity of O ( ■PN+ 
JKN), the main computational cost is to calculate the gradient V/(Wt) with the time complexity 
of 0{.PK + J\E\) in each iteration. Note that the per-iteration complexity of our method is (i) in- 
dependent of sample size N, which can be very large for large-scale applications, and (ii) linear in 
the number of edges \E\, which can also be large in many cases. In comparison, the second-order 
method such as SOCP has a much higher complexity per iteration. According to [|T2l . SOCP costs 
0{J'^{K + \E\)'^(KN + JK + J\E\)) per iteration, thus cubic in the number of edges and linear in 
sample size. Moreover, each IPM iteration of SOCP requires significantly more memory to store 
the Newton linear system. 

4.4 Proximal-Gradient Method for General Fused Lasso 

The proximal-gradient method for GFlasso that we presented in the previous section can be easily 
adopted to efficiently solve any types of optimization problems with a smooth convex loss function 
and fusion penalties such as fused-lasso regression and fused-lasso signal approximator [[T8l [8l. 
We emphasize that our method can be applied to the fusion penalty defined on an arbitrary graph 
structure, while the widely adopted pathwise coordinate method is only known to be applied to the 
fusion penalty defined on special graph structures, i.e., chain and two-way grid. For example, the 
general fused lasso solves a univariate regression problem with a graph fusion penalty as follows: 

1 -^ 

/3 = argmin-||y-X/3||^ + A^|/3,|+7 J^ |/3™-A|, (14) 



/3 



^{m,l)&E 



where || ■ II2 denotes a vector ^2 -norm, y G M^ is the univariate response vector of length N, 
X G M^^"' is the input matrix, /3 G M-^ is the regression coefficient vector, and E is the edge 
set in graph G = (V, E) with V = {1, . . . , J}. Note that the fusion penalty defined on inputs 
ordered in time as a chain (i.e., XliCi \Pj+i ~ Pj\) IfTSlI is a special case of the penalty in (fT4l) . It is 
straightforward to apply our proximal-gradient method to solve (fT4l) with only a slight modification 
of the linear mapping T((3) = Cf3 and its adjoint r*(Q;) = C^ol. 

To the best of our knowledge, the only gradient-based method available for optimizing (fT4l) is 
the recent work in |[8l that adopts a path algorithm. However, this relatively complex algorithm 
works only for the case of A^ > J, and does not have any theoretical guarantees on convergence. 
In contrast, our method is more generally applicable, is simpler to implement, and has a faster 
convergence rate. 

5 Asymptotic Consistency Analysis 

It is possible to derive results on the asymptotic behavior of the GFlasso estimator that is analogous 
to the ones in lasso IfTTl and fused lasso [18] when J and K are fixed as A^ — t- 00. Assume B is 
the true coefficient matrix and Bjv is the estimator obtained by optimizing (H)). Also assume that 
the regularization parameter \n and '^n are functions of A^. We have the following statistical 
convergence result: 

10 
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Theorem 3. I/Xn/VN ^ Xq > 0, -/n/VN ^ -/q > and C = limiv^oo (^ E»^i 
non-singular, where Xj is the i-th row of^, then 

VNiBN - B) ^a argmin V(IJ), (15) 

where 

\/(U) = -2j2 UfeW + Y, ^ICuk + A^') Yl E K-'t^''^"(/5^-'t)^(/5^-fc ^ 0) + l%fcl^(/3.fe = 0)] 

A; k k j 

+Ai'^ 5^ r(r„0 J2 [u',,signiP'^^)IiP'^^ ^ 0) + |«;.J/(/3;, = 

e={m,l)&E j 



with u'j^ = Ujrn — sign{rm,i)uji and /3jg = /Sjm — sign{r-mi)Pji, and W has an A^(0, a C) distribu- 
tion. 

The proof of the theorem is provided in Appendix. 

6 Experiments 

In this section, we demonstrate the performance of GFLasso on both simulated and real data, and 
show the superiority of our proximal-gradient method (Prox-Grad) to the existing optimization 
methods. Based on our experience for a range of values for //, we use ji = 10^'^ in all of our 
experiments, since it provided us reasonably good approximation accuracies across problems of 
different scales. The regularization parameters A and 7 are chosen by cross-validation. The code 
is written in MATLAB and we terminate our optimization procedure when the relative changes in 
the objective is below 10"^. 

6.1 Simulation Study 

6.1.1 Performance of GFlasso on the Recovery of Sparsity 

We conduct a simulation study to evaluate the performance of the proposed GFlasso, and compare 
the results with those from lasso and £1/^2 -regularized multi-task regression. 

We simulate data using the following scenario analogous to genetic association mapping with 
-ft' = 10, J = 30 and A^ = 100. To simulate the input data, we use the genotypes of the 60 
individuals from the parents of the HapMap CEU panel, and generate genotypes for additional 40 
individuals by randomly mating the original 60 individuals. We generate the regression coefficients 
/3^,'s such that the output y^'s are correlated with a block-like structure in the correlation matrix. 
We first choose input-output pairs with non-zero regression coefficients as we describe below. We 
assume three groups of correlated output variables of sizes 3, 3, and 4. Three relevant inputs are 
randomly selected for the first group of outputs, and four relevant inputs are selected for each of the 
other two groups, so that the shared relevant inputs induce correlation among the outputs within 
each cluster. In addition, we assume another relevant input for outputs in both of the first two 
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Figure 2: Regression coefficients estimated by different methods based on a single simulated 

dataset. b = 0.8 and threshold p = 0.3 for the output correlation graph are used. Red pixels 
indicate large values, (a) The correlation coefficient matrix of phenotypes, (b) the edges of the 
phenotype correlation graph obtained at threshold 0.3 are shown as white pixels, (c) the true re- 
gression coefficients used in simulation. Absolute values of the estimated regression coefficients 
are shown for (d) lasso, (e) £i/£2-regularized multi-task regression, (f) GFlasso. Rows correspond 
to outputs and columns to inputs. 
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Figure 3: ROC curves for comparing different sparse regression methods with varying signal-to- 
noise ratios. The b is set to (a) 0.3, (b) 0.5, (c) 0.8, and (d) 1.0. 

clusters in order to model the situation of a higher-level correlation structure across two subgraphs. 
Finally, we assume one additional relevant input for all of the phenotypes. Given the sparsity 
pattern of B, we set all non-zero (3ij to a constant b to construct the true coefficient matrix B. Then, 
we simulate output data based on the linear-regression model with noise distributed as A^(0, 1) as 
in ([U), using the simulated genotypes as covariates. 

We select the values of the regularization parameters A and 7 by using (N — 30) samples out 
of the total A^ samples as a training set, and the remaining 30 samples as a validation set. Then, 
we use the entire dataset of size N to estimate the final regression coefficients given the selected 
regularization parameters. 

As an illustrative example, a graphical display of the estimated regression coefficients from 
different methods is shown in Figure [2l It is apparent from Figures [2td) and (e) that many false 
positives show up in the results of lasso and £i/£2-regularized multi-task regression. On the other 
hand, the results from GFlasso in Figure |2tf) show fewer false positives and reveal clear block 
structures. This experiment suggests that borrowing information across correlated outputs in the 
output graph as in GFlasso can significantly increase the power of discovering true relevant inputs. 

We systematically and quantitatively evaluate the performance of the various methods by com- 
puting sensitivity/specificity on the recovered sets of relevant inputs and prediction errors aver- 
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Figure 4: ROC curves for comparison of sparse regression methods with varying thresholds (p's) 
for output graph structures, (a) p=0.1, (b) p=0.3, (c) p=0.5, and (d) p=Q.l . We use h = 0.8 for 
signal-to-noise ratio. 
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Figure 5: Comparison of multi-task learning methods in terms of prediction error. The threshold 
p for the output correlation graph is (a) p=0.1, (b) p=0.3, (c) p=0.5, and (d) p=OJ. We use b = 0.8 
for signal-to-noise ratio. 

aged over 50 randomly generated datasets. We generate additional 50 individuals in each training 
dataset, and compute the prediction error for this test dataset. In order to examine how varying the 
signal-to-noise ratio affects the performances of the different methods, we simulate datasets with 
the non-zero elements of the regression coefficient matrix B set to b = 0.3, 0.5, 0.8, and 1.0, and 
compute the ROC curves as shown in Figure [3l A threshold of p=0. 1 is used to generate output 
correlation graphs in GFIasso. We find that GFIasso outperforms the other methods for all of the 
four chosen signal-to-noise ratios. 

Next, we examine the sensitivity of GFIasso to how the output correlation graph is generated, 
by varying the threshold p of edge weights from 0.1 to 0.3, 0.5 and 0.7. With lower values of p, 
more edges would be included in the graph, some of which represent only weak correlations. The 
purpose of this experiment is to see whether the performance of GFIasso is negatively affected by 
the presence of these weak and possibly spurious edges that are included due to noise rather than 
from a true correlation. The results are presented in FigurelH GFIasso exhibits a greater power than 
all other methods even at a low threshold p=0.1. As the threshold p increases, the inferred graph 
strucutre includes only those edges with significant correlations. When the threshold becomes even 
higher, e.g., p = 0.7, the number of edges in the graph becomes close to 0, effectively removing the 
fusion penalty. As a result, the performances of GFIasso approaches that of lasso, and the two ROC 
curves almost entirely overlap (Figure Sfd)). Overall, we conclude that when flexible structured 
methods such as GFIasso are used, taking into account the correlation structure in outputs improves 
the power of detecting true relevant inputs regardless of the values for p. In addition, once the graph 
contains edges that capture strong correlations, including more edges beyond this point by further 
lowering the threshold p does not significantly affect the performance of GFIasso. 
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Figure [5] shows the prediction errors using the models learned from the above experiments 
summarized in FigureHl It can be seen that GFlasso generally offers a better predictive power than 
other methods, except for the case where the set of edges for the graph becomes nearly empty due 
to the high correlation threshold p=0.7 (Figure Od)). In this case, all of methods perform similarly. 

6.1.2 Computation Time 

In this section, we compare the computation time of our Prox-Grad with those of SOCP and QP 
formulations for solving GFlasso using simulation data. We use SDPT3 package [20] to solve the 
SOCP formulation. For the QP formulation, we compare two packages, MOSEK [2J and CPLEX 
OJ, and choose CPLEX since it performes better in terms of computation time. The computation 
time is reported as the CPU time for one run on the entire training set using the best selected 
regularization parameters. 

To compare the scalability of Prox-Grad with those of SOCP and QP, we vary J, A^, K, p and 
present the computation time in seconds in log-scale in Figures [6ta)-(d), respectively. All of the 
experiments are performed on a PC with Intel Core 2 Quad Q6600 CPU 2.4GHz CPU and 4GB 
RAM. We point out that when we vary the threshold p for generating the output graph in Figure 
[6td), the increase of p decreases the number of edges \E\ and hence reduces the computation time. 
In Figure [6l for large values of J, A^, K and small values of p, we are unable to collect results for 
SOCP and QP, because they lead to out-of-memory errors due to the large storage requirement for 
solving the Newton linear system. 

In Figure [6l we find that Prox-Grad is substantially more efficient and can scale up to very 
high-dimensional and large-scale datasets. QP is more efficient than SOCP since it removes the 
non-smooth ii terms by introducing auxiliary variables for each ii term. In addition, we notice 
that the increase of A^ does not increase the computation time significantly. This is because A^ only 
affects the computation time of X^X and X^y, which can be pre-computed, and does not affect 
the time complexity for each iteration during optimization. This observation is consistent with our 
complexity analysis in Section |431 
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Figure 6: Comparisons of scalabilities of various optimization methods. For Prox-Grad, SCOP 
and QP, we (a) vary J from 50 to 500 with a step size of 50 and then from 1000 to 10, 000 with a 
step size of 1000, fixing A^ = 1000, K = 50 and p = 0.5, (b) vary A^ from 500 to 10000 with a step 
size of 500, fixing J = 100, i^ = 50 and p = 0.5, (c) vary K from 50 to 500 with a step size of 50 
and then from 1000 to 10, 000 with a step size of 1000, fixing A^ = 500, J = 100 and p = 0.5, and 
(d) vary p from 0.1 to 0.9 with a step size of 0.1, fixing A^ = 500, J = 100 and K = 50. Note that 
the y-axis denotes the computation time in seconds in log-scale. 
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6.2 Asthma Dataset 

We apply GFlasso to 34 genetic markers and 53 clinical phenotyes collected from 543 asthma pa- 
tients as a part of the Severe Asthma Research Program (SARP) lfT3l . and compare the results with 
the ones from lasso and £1/^2 -regularized regression. Figure |7t a) shows the correlation matrix of 
the phenotypes after reordering the variables using the agglomerative hierarchical clustering algo- 
rithm so that highly correlated phenotypes are clustered with a block structure along the diagonal. 
Using the threshold p = 0.7, we fit the standard lasso, £i/£2-regularized multi-task regression, 
and GFlasso, and show the estimated /3fc's in Figures |7];c)-(e), with rows and columns representing 
phenotypes and genotypes respectively. The phenotypes in rows are rearranged according to the 
ordering given by the agglomerative hierarchical clustering so that each row in Figures |7tc)-(e) is 
aligned with the phenotypes in the correlation matrix in Figure |7]; a). We can see that the vertical 
bars in the GFlasso estimate in Figure |7te) span the subset of highly correlated phenotypes that 
correspond to blocks in Figure |7]; a). This block structure is much weaker in the results from the 
lasso in FigureHc), and the blocks tend to span the entire set of phenotypes in the results from the 
£1/^2 -regularized multi-task regression in Figure jT^d). 




1^ 




(a) (b) (c) (d) (e) 

Figure 7: Results for the association analysis of the asthma dataset. (a) Phenotype correlation 
matrix, (b) Phenotype correlation matrix thresholded at p = 0.7. (c) lasso; (d) £i/£2-regularized 
multi-task regression, and (e) GFlasso. 



7 Conclusions 

In this paper, we discuss a new method called GFlasso for structured multi-task regression that ex- 
ploits the correlation information in the output variables during the estimation of regression coeffi- 
cients. GFlasso used a weighted graph structure as a guide to find the set of relevant covariates that 
jointly affect highly correlated outputs. In addition, we propose an efficient optimization algorithm 
based on a proximal-gradient method that can be used to solve GFlasso as well as any optimization 
problems involving a smooth convex loss and fusion penalty defined on arbitrary graph structures. 
Using simulated and asthma datasets, we demonstrate that including richer information about out- 
put structure as in GFlasso improves the performance for discovering true relevant inputs, and that 
our proximal- gradient method is orders-of-magnitude faster and more scalable than the standard 
optimization techniques such as QP and SOCP. 
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Appendix 

A. 1 Proof of Theorem [I 

The //i(B) is a convex function since it is the maximum of a set of functions linear in B. 

For the smoothness property, let the function d* be the Fenchel conjugate of the distance func- 
tion d which is defined as: 

d*(&)=max(A,@)-d(A). (16) 

AeQ 

We want to prove d* is differentiable everywhere by showing that the subdifferential dd* of d* is a 

singleton set for any 0. 

By the definition in (fT6l ). we have, for any and any AeQ: 

t/*(0) + rf(A)> (A,0), (17) 

and the inequality holds as an equality if and only if A = arg max^,gg(A', 0) — d(A'). 

By the fact that for a convex and smooth function, the conjugate of the conjugate of a function 
is the function itself (Chapter E in [7J), we have d** = d. Then, (flTI) can be written as: 

d*{e) + d**{A)>{A,&), (18) 

and the inequality holds as an equality if and only if = arg max^/^^j (A, 0') — d*{@') . 

Since (fTVl ) and (fTSi) are equivalent, we know that A = arg max^,gQ A'^0 — d(A') if and only 
if = arg maxQ/gjjjj (A, 0') — d* (0') . The latter equality implies that for any 0': 

d*{&') >rf*(0) + (A,0'-0), 

which further means that A is a subgradient of d* at by the definition of subgradient. 

Summarizing the above arguments, we conclude that A is a subgradient of d* at if and only 

if 

A = argmax(A',0) -rf(A'). (19) 

A'eQ 

Since d is a strongly convex function, this maximization problem in (fT9l ) has a unique optimal 

solution, which means the subdifferential dd* of d* at any point is a singleton set that contains 

only A. Therefore, d* is differentiable everywhere (Chapter D in fT\) and A is its gradient: 

Vd*{&) = A = argmax(A', 0) - d{A'). (20) 

A'eQ 

Now we return to our original problem of /^(B) and rewrite it as: 

/^(B) = max(A, r(B)) - /id(A) = /imax[(A, -^) - d{A)] = fid*{-^). 

Utilizing (l20l) and the chain rule, we know that /^(B) is continuously differentiable and its 
gradient takes the following form: 

V^(B) = ;,r(Vrf*(^^))=/iP(argmax[(A',^-rf(A')]) 

/^ A'eQ /i 

= r(argmax[(A',r(B)) -/xd(AO]) = r(A*). 
A'eQ 
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For the proof of Lipschitz constant of /^ (B) , readers can refer to [T?! . 

A.2 Proof of Lemma |2] 

According to the definition of ||r||, we have: 

llrll = max ||r(B)||^= max \\(XB,-fBH)\\F 

" " llTIill T llRll T 



= max JX^\\B\\% + -f^\\BH\\l= max a/A^ + 72||Bi/||2„ 

Therefore, to bound ||r||, we only need to find an upper bound for max||B||^=i ||Biy|||,. 
According to the formulation of matrix H, we have 

\\BH\\l= Yl ir{rmi)?Y.^(3,m-sign{r^i)f3,if (21) 

e=(m,l)&E j 

It is well known that (a — 6)^ < 2a^ + 26^ and the inequality holds as equality if and only if 
a = —b. Using this simple inequality, for each edge e = (m, /) G E, the summation J2ji(^jm — 
sign{rmi)l3jif is upper-bounded by T.ji'^Pjm + '^P'ji) = 2||/3„||i + 2||/3j||^. Here, the vectors f3^ 
and /3; are the m-th and /-th columns of B. The right-hand side of (f2T)) can be further bounded as: 



mmi < j:e=im,l)^E^irirml)n\\l3J\l + II Alii) 
— Z^fceyvA^e incident on fc ^(''"(^e)) jUPfelb 

= I]fcey24||/3fcll2) 

where dk is defined in Q. Note that the first inequality is tight, and that the first equality can be 
obtained simply by changing the order of summations. 

By the definition of Frobenius norm, ||B|||, = ^^ ll/^fclli- Hence, 

max ||Bi7||^ < max > 2(ifc||/3jt||2 = 2max(ifc, 

I|B|If=1 i|B|lF = l ■^""^ ^ 

II Hi- II IIJ- ^ 

where the maximum is achieved by setting the /3^ corresponding to the largest dk to be a unit vector 
and setting other /S^'s to be zero vectors. 

In summary, ||r|| can be tightly upper bounded as: 

||r|| = max||B||^=i ||r(B)||i7 



= max||B||j,=i ■s/X'^ + ^'^\\BH\\f 
= a/ A^ + 7^ max||B||j,= i ||Bg||jr 
< a/A^ + 272 maxfc 4 = \\T\\u. 
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A.3 Proof of Theorem H 

Based on Theorem 2 in [[T4|. we have the following lemma: 



Lemma 3. Assume that function /(B) is an arbitrary convex smooth function and its gradient 
V/(B) is Lipschitz continuous with the Lipschitz constant L that is further upper-bounded by Lu- 
Apply Algorithm 1 to minimize /(B) and let B* be the approximate solution at the t-th iteration. 
For any B, we have the following bound: 

7(B*) - 7(B) < ^^^I'f"^ (22) 

Based on Lemma[3l we present our proof. Recall that the smooth approximation of the function 

/(B), 7(B), is defined as: 

7(B) = -IIY - XBlll + L(B) = -IIY - XBlll + max(A, BC) - -IIAfp. 
Since Algorithm 1 optimizes the smooth function /(B), according to Lemma [3l we have 

7(B*) - 7(B*) < ^M^, (23) 



where Lu = Aniax(X'^X) + -^-^ is the upper bound of the Lipschitz constant for V/(B) . 
We want to utilize the bound in (l23l) : so we decompose /(B*) — /(B*) into three terms: 

/(B*) - /(B*) = (/(B*) - 7(B*)) + (7(B*) - 7(B*)) + (7(B*) - /(B*)) . (24) 

According to the definition of /, we know that for any B, 

7(B) < /(B) < 7(B) + /iD, 

where D = maxAes d{A). Therefore, the first term in (|24l) . /(B*) — /(B*), is upper-bounded by 
fiD; and the last term in (|24l) is less than or equal to 0, i.e. 7(B*) - /(B*) < 0. Combining ([23]) 
with these two simple bounds, we have: 

97'llR*l|2 9ll"R*l|2 /||Fl|2 \ 

/(B*) - /(B*) <f^D+ '-M^ <^D + ^^ ^^ + A_(X^X)J . (25) 
By setting 1^ = jq and plugging it into the right hand side of (|25l) . we obtain 

/(B*) - /(B*) < i + ^^ (^^Hli + A_ (X^X)) . (26) 

If we require the right-hand side of (l26l) to be equal to e and solve for t, we obtain the bound of t 
in([l3]). 

Note that we can set ^u = ^ for any h > Ito achieve O (i) convergence rate, which is different 
from (fT3l) only by a constant factor. 
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A.4 Proof of Theorem |3] 

Define ^^(U) by 

K N 

^iv(u) = E E [('^^ - ^l^^i^f - 4] + Aii^ E E 0/3.^ + «.Vv^i - \PA] 

fc=l J=l k j 

(m,0 J 



Note that Vn{u.) is minimized at v A^(BAr — B). Notice that we have 

K N 



E E h^ - -r-^/v^)' - ^l] ^^ -2 Et-^W + nlCn,], 

k=l i=l k 

^S^EE [\I^Jk+u,k/VN\ - |/3,fc|] ^, A^'^EE h^sign(/3,fc)J(/3,fc ^ 0) + |M,fc|/(/3,-fc = 

k j k j 

iS^ E /(^-oE0/3;e+«yViv|-i/3;,i] 

e={m,l)£E j 

-^d A?^ E /(^-') E KeSign(/3ie)/(/3;e 7^ 0) + ^el^l/^^e = 0)] • 
e=(m,l)eE j 

Thus, V/v(U) — T-,^ ^(U) with the finite-dimensional convergence holding trivially. Since Vn 
is convex and V has a unique minimum, it follows that argminu Vn{\J) = \/n(B — B) — >^ 
argminu ^(U). 
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